Large-scale digital forensic investigation for Windows registry on Apache Spark

In this study, we investigate large-scale digital forensic investigation on Apache Spark using a Windows registry. Because the Windows registry depends on the system on which it operates, the existing forensic methods on the Windows registry have been targeted on the Windows registry in a single system. However, it is a critical issue to analyze large-scale registry data collected from several Windows systems because it allows us to detect suspiciously changed data by comparing the Windows registry in multiple systems. To this end, we devise distributed algorithms to analyze large-scale registry data collected from multiple Windows systems on the Apache Spark framework. First, we define three main scenarios in which we classify the existing registry forensic studies into them. Second, we propose an algorithm to load the Windows registry into the Hadoop distributed file system (HDFS) for subsequent forensics. Third, we propose a distributed algorithm for each defined forensic scenario using Apache Spark operations. Through extensive experiments using eight nodes in an actual distributed environment, we demonstrate that the proposed method can perform forensics efficiently on large-scale registry data. Specifically, we perform forensics on 1.52 GB of Windows registry data collected from four computers and show that the proposed algorithms can reduce the processing time by up to approximately 3.31 times, as we increase the number of CPUs from 1 to 8 and the number of worker nodes from 2 to 8. Because the distributed algorithms on Apache Spark require the inherent network and MapReduce overheads, this improvement of the processing performance verifies the efficiency and scalability of the proposed algorithms.


Introduction
In this study, we investigate large-scale digital forensic investigation on Apache Spark using a Windows registry. The Windows registry is a tree-structured database that stores necessary information for the Windows operating system and the installed programs such as version information, configurations, and the referencing file locations [1]. In the Windows registry, the registry key and its corresponding registry value are stored in the form of a key-value pair. The Windows registry stores critical information, including user accounts and program locations executed, when the system is booting [2,3]. This information can be abused for cyberattacks such as dll hijacking, malware persistence, and privilege escalation by manipulating the stored information. This implies that the data stored in the Windows registry are critical evidence for digital forensics to effectively detect cyberattacks. As a result, forensics on the Windows registry is one of the representative forensic types in Windows systems [4][5][6].
In this study, we devised distributed algorithms to analyze large-scale registry data collected from a number of Windows systems on the Apache Spark framework. Fig 2 shows the overall proposed framework compared to the existing forensic approach on the Windows registry. The existing approach analyzes Windows registry targeting on a single Windows system. However, it can perform forensic analysis only for a single registry and has a limitation to identify malicious entries in a registry based on the comparison with other multiple Windows registry repositories. On the other hand, in our framework, we extract the Windows registry from several Windows systems and transform and load them into the Hadoop distributed file system (HDFS) on a Hadoop cluster. Then, we perform forensic analysis on large-scale data files stored on HDFS. Therefore, we can compare multiple Windows registry repositories in a scalable way and identify malicious registry entries based on them.
The contributions of the paper are summarized as follow: 1. In this study, we manage and analyze large-scale Windows registry collected from multiple systems. For this purpose, we present an algorithm that transforms and loads the Windows registry collected from multiple systems to HDFS using Apache Spark operations. This allows forensics on large-scale Windows registry on a Hadoop cluster, which provides scalability for storing large-scale Windows registry.
2. We propose distributed algorithms using Apache Spark operations for forensics on the Windows registry stored on HDFS. We define three main scenarios by classifying the existing registry forensic methods into the scenarios and propose a distributed algorithm for each scenario. The proposed algorithms allow us to perform forensics faster than that in a single machine by processing it in parallel on Apache Spark using multiple nodes in a Hadoop cluster.
3. The proposed method was used to conduct forensics on 1.52 GB of Windows registry data collected from four Windows systems. Consequently, we show that the proposed method can reduce the processing time by up to approximately 3.31 times as we increase the number of CPUs from 1 to 8 and the number of worker nodes from 1 to 8 because of the proposed algorithms leveraging distributed and parallel processing effectively, thereby validating the efficiency and scalability of the proposed algorithms.
The remainder of this paper is organized as follows. In Section 3, we extensively survey and review existing forensic analysis techniques. In Section 2, we explain the Windows registry, MapReduce, and Apache Spark as the background of the paper. In Section 4, we define three forensic scenarios and propose an algorithm using Apache Spark operations for each scenario. In Section 5, we present the experimental results. Finally, we present our conclusions in Section 6.

Windows registry
To extract the Windows registry from multiple Windows systems, we need a consistent approach to extract it from each different version of Windows system including servers, PCs, and mobile versions. In this study, we extract the registry data from a target Windows system using Regedit [7], which is the built-in registry editor supported in any version of Windows systems. Fig 3 shows an example of the exported file. We note that we can apply the existing forensic methods for Windows registry into the exported text file because all the information in the original registry is maintained in the text file, including the hierarchies of registry keys, i.e., the relationship between registry keys and their subkeys [2]. Here, we can collect all the registry keys and values stored in a system only by specifying the root key of the registry.

MapReduce and Apache Spark
MapReduce is a framework developed by Google for providing distributed and parallel computing of large-scale datasets based on a cluster of multiple nodes connected on the network [8]. The MapReduce framework consists of two stages: 1) map stage, which partitions the entire dataset into multiple data chunks and assigns each data chunk to one node, and 2) reduce stage, which aggregates the sub-results obtained from all the involved nodes as a final result.

Apache
Spark is an open-source software platform that supports the MapReduce framework [9,10]. Apache Spark works with underlying systems for distributed environments, i.e., HDFS [11], for storing and managing large-scale data files and YARN for managing resources in a distributed environment. Because it can manage large-scale data that cannot be stored in a single computer and provide parallel processing using multiple nodes in distributed environments, there have been many research efforts to apply the traditional problem in a single computer to a distributed environment using Apache Spark platform [12,13]. Hsu et al. [14] reduced the time of processing and mining the tweet data that can be used as evidence for drug side effects by up to 2.5 times by partitioning the dataset into two nodes with 12 cores and processing them based on Apache Spark. Harine et al. [15] reduced the processing time of machine learning algorithms that predict molecules affecting proteins associated with diseases by up to 14.19 times based on Apache Spark using 20 nodes.

Windows forensics
We can classify existing studies on Windows forensics as follows: 1) registry forensics, 2) memory forensics, and 3) application forensics. Registry forensics investigate the data stored in the Windows registry to detect evidence of cyberattacks. Venčkauskas et al. [16] captured the entire registry and examined every stage of the installation, execution, and removal of a certain program. Wong et al. [17] and Farmer et al. [18] investigated recently updated registry keys using the keytime.exe software. Alghafli et al. [19] classified registry keys into categories such as systems, applications, and networks and examined the registry values contained in each category. Verma et al. [20] collected the whole registry using an open source tool, Regshot, and compared two registry repositories captured at different time points to detect newly installed malicious software. Casey et al. [4] performed forensic analysis on a certain registry key, 'HKCU\Software\RetinaxStudios', to trace a mobile surveillance program, MobileSpy. Rehualt [21] classified the registry data in the mobile device based on specific suspicious directories such as 'settings\default\user.hv' and 'settings\system.hv' and demonstrated that forensics for Windows mobile phones can be performed the same as in Windows PCs. Roy and Jain [22] examined registry keys created when the PC recognizes USB, e.g., 'HKLM\System\Control-Set00x\Enum\USBSTOR\device_class\device_unique_id'. Klaver [23] performed forensic analysis not only for active data but also for data deleted on Windows mobile devices using various recovery techniques such as chip extraction, using JTAG, and a boot loader.
Memory forensics investigate data residing in memory to detect malicious actions of running programs. Ruff et al. [5] employed various techniques for capturing real-time data in memory, such as CrashDump and Snapshot, and verified their effectiveness through practical examples. Schuster et al. [24] proposed pool allocation mechanisms to collect volatile data on the current and past processes and performed forensic analysis on them. Canlar et al. [25] proposed a real-time data acquisition method from both random access memory (RAM) and electronic enabling programmable read-only memory (EEPROM) on a Windows mobile device.
Application forensics investigate the results operated by certain applications to detect malicious actions. Gianni and Solinas [6] performed real-time forensic analysis on two different versions of Windows, i.e., Windows XP and Windows 7, targeting common applications such as Skype, Google Talk, and Internet Explorer. Yang et al. [26] analyzed the artifacts that remained after the instant messaging services of Facebook and Skype were run on the Windows operating system. Murphey [27] proposed a method for extracting complete log data for forensics, including not only the normal Windows event log data but also the log data that cannot be currently accessed by repairing and recovering them. Chang et al. [28] examined artifacts remaining after various actions such as installation, uninstallation, log-in, chatting, and file transferring occurred by the LINE application on Windows 10.
Other existing studies on Windows forensics are as follows. Ahmadi et al. [29] minimized the time required for imaging in forensics, which creates a duplicate of the media, by targeting only necessary information such as system logs, Windows registry, and recycle bin instead of the entire data. Yang et al. [30] proposed an efficient forensic method targeting a cloud storage service such as CloudMe.
As described, the existing Windows forensic studies focused on the registry entries extracted from a single system. They usually focused on target Windows registries where it has been known that suspicious information is stored. On the other hand, in this study, we propose a scalable forensic analysis of Windows registry by introducing Spark-based forensic framework to compare the entire Windows registry repositories collected from multiple systems, focusing on the differential registry entries.

Big data forensics
Zawoad and Hasan [31] proposed a conceptual model for big data forensics based on HDFS and cloud architecture and reduced the entire forensic time by eliminating redundant data from incoming streaming data. Adedayo et al. [32] presented new challenges and opportunities for forensic analysis of large-scale data such as identification, collection, organization, preservation, and presentation. Thaneker et al. [33] proposed a forensic framework on largescale data that uses Hadoop for storing and managing forensic evidence, and used Autopsy, an open-source forensic tool, for file carving, data carving, and keyword searching.
Qi et al. [34] dealt with forensic analysis of large-scale data by considering four types of NoSQL databases as an alternative to RDBMS: 1) key-value databases, 2) document databases, 3) column-family databases, and 4) graph databases. Then, they evaluated the processing performance of the forensic analysis using two representative NoSQL databases, i.e., Mon-goDB and Riak, showing that Riak performs better than MongoDB.
In summary, to the best of our knowledge, there have been no research efforts that focus on forensic analysis on the Windows registry in distributed nodes to deal with large-scale data. Instead, as described, there have been previous studies on large-scale data except for the Windows registry. Furthermore, most studies have focused on the concept of big data forensics or the overall framework rather than specific algorithms.

Digital forensics using Apache Spark and MapReduce
There have been few existing studies on forensic analysis using Apache Spark. Hemdan et al. [12] examined large-scale log data collected from the cloud service's web server using Apache Spark to reconstruct cyber crimes. Gonzales et al. [13] proposed a forensic framework based on distributed computing platforms such as Apache Spark and Kafka to collect and detect forensic evidence stored on hard drives. Then, they demonstrated that the performance of the distributed computing platform was substantially faster than the standalone version of Autospy. Chhabra et al. [35] proposed a forensic framework that uses MapReduce to analyze dynamic traffic features from large-scale traffic data generated from IoT devices such as Raspberry Pi for malicious traffic detection. Guarino [36] introduced how big data analysis techniques and algorithms such as MapReduce and decision trees can be adapted to each step of digital forensics: identification, collection, acquisition, preservation, analysis, and reporting.
These existing studies have also performed digital forensics through distributed frameworks as the same as our study. However, they have not considered the forensic analysis of Windows registry based on distributed frameworks. In this study, we present Apache Spark-based forensic analysis for Windows registry and propose a scalable analysis methodology for dealing with multiple Windows registry repositories for the first time.

Forensic scenarios
We define three forensic scenarios through an investigation of the existing studies on the Windows registry forensic and map existing studies into these scenarios as follows: Scenario 1. Forensic for target registry keys.
• Case 1. To trace MobileSpy, one of the mobile surveillance programs, performing forensic analysis for a certain registry key, 'HKCU\Software\RetianxStudios', which is created when the program is installed [4] • Case 2. Performing forensic analysis on system-relevant registry keys(e.g., 'HKLM\Software \Apps', information of the installed softwares, and 'HKLM \System\Uptime', time of the last system booting) and user-relevant registry keys(e.g., 'HKCU\ControlPanel\Owner', owner information, and 'HKCU \Software\Microsoft\ActiveSync', smartphone UID used when syncing with a computer) [21] • Case 3. Identifying registry keys created when the computer systems recognize the USB(e.g., 'HKLM\System \ControlSet00x\Enum\USBTOR \<device_unique_id>') [22] • Case 4. Classifying the registry keys into systems, applications, and networks, and performing forensic analysis on each category [19] • Case 5. Classifying the registry keys selected for forensics into three types: hardware, software, and network, and performing forensic analysis on those registry keys [37] Scenario 2. Forensic for registry keys and values containing target keywords.

Registry data processing
The Windows registry is formed as a tree-based structure consisting of key-value pair entries. However, the data exported from the Windows registry are stored in a text file, as shown in Fig  3. In this section, we present algorithms for manipulating the exported registry data in the form of a tree structure to maintain the original hierarchy of the registry keys.

Converting the registry entry to nested key-value data.
We convert the text data exported from the Windows registry into the form of key-value data so we can access each level of the registry key in the tree structure. For this purpose, we encapsulate each level of the subkey as the value of the key-value data at a nested level. For example, a registry entry, 'HKCC\Software\Fonts \Logpixels = dword:00000060', is converted into the following nested key-value data, '{HKCC: {Software: {Fonts: {Logpixels: dword:00000060}}}}'.
Algorithm 1 shows ConvertRegEntryToNestedKeyValue(), which converts a registry entry into a form of nested key-value data. It receives RegEntry, which is a line of the text data exported from the Windows registry, as the input and returns RegNestedKeyValue, which is a nested key-value data converted from RegEntry. In Lines 1*3, the algorithm separates the registry key and the registry value by '='. Then, because each level of registry subkey is distinguished by '\', the algorithm splits the entire registry key path by '\' and stores them into keys as an array while storing the registry value into value. In Lines 4*8, we construct nestedValue by appending each level of key path into the form of key-value pairs. This process is repeated until all subkeys are appended. The final result stored in RegNestedKeyValue becomes the nested key-value data for an input RegEntry.
Algorithm 1: ConvertRegEntryToNestedKeyValue()  In this section, we present a method to merge registry entries based on the same common registry key path and convert it into a list for a single common key path. This method scans two adjacent registry entries for the entire registry repository and merges based on the common registry. By applying this method to the flattened text file extracted from the Windows registry, we can construct tree-structured data from the text file. Algorithm 2 shows MergeRegNestedEntries() that merges registry entries based on the common registry key path. It receives two nested registry entries, RNE 1 and RNE 2 , which are obtained by ConverteRegEntryToNestedKeyValue(), as the input and returns the registry entries that are merged by the common registry key path, RNE 1+2 . In Lines 2*8, the algorithm finds the common registry key path of RNE 1 and RNE 2 by comparing each level of RNE 1 and RNE 2 from the root key until the subkeys are different. The common registry key path is stored in keys. In Line 9, the remaining registry keys and values in RNE 1 and RNE 2 are stored in a list for the result value. In Lines 10*12, it constructs the common path and the list as one structure, RNE 1+2 .   step to the result in Fig 5(b). It shows a process of merging RNE 1 + 2, with another registry entry, RNE 3 . '{HKU:{Control Panel:{}}}' becomes the common registry key path and the remaining subkey and registry value in RNE 3 , '{Desktop:{Status:True}}', is added to RNE 1 + 2. By applying MergeRegNestedEntries() into adjacent registry entries continuously, we can finally construct a tree structure.

Comparing registry entries.
It is worthwhile to detect the differences between two entire Windows registry data targeting the following two cases: 1) a certain registry key and the associated registry value exist only in one Windows registry or 2) different values for the same registry key, which is crucial evidence for forensics. In this section, we present an algorithm to compare one registry entry from a registry repository with another registry repository. This will be used as the basic function for the algorithm to compare two entire registry repositories using Apache Spark operations in Section 4.4.3.
Algorithm 3 shows ComparingRegEntries(), which compares a registry entry from a registry repository with the entire registry entries from another registry repository. Here, a registry entry exported from one registry repository, RegNestedEntry, is compared with the entire registry entries exported from another registry repository, RegRepository. RegNestedEntry is nested key-value data converted from a registry entry by calling ConvertRegEntryToNestedKey-Value(). RegRepository is the entire Windows registry data where all the registry entries are converted into nested key-value data and then transformed into a single tree structure by calling MergeRegNestedEntries() into adjacent registry entries.
The algorithm compares RegNestedEntry with each entry in the RegRepository in the while loop. The algorithm identifies two suspicious cases. First, the registry key of RegNestedEntry does not exist in the RegRepository. Second, the registry key of RegNestedEntry exists in the RegRepository, but their registry values are different. In Lines 5*11, it checks the registry keys between two entries. Because all the subkeys in each level of the registry repository are maintained in RegRepository, in Line 6, we can easily check if each level of the registry subkey in RegNestedEntry exists in RegRepository. In Lines 7*8, we inspect the next level of registry subkey only if the previous registry subkeys are the same. If the algorithm finds the different subkeys between them, it appends RegNestedEntry into CompResult. If the algorithm reaches Line 13, this means that the registry key of RegNestedEntry exists in RegRepository. In Lines 13*18, the algorithm checks if their registry values are different. If the value of RegNestedEntry is different from that of the corresponding entry in the RegRepository, it appends RegNestedEntry into CompResult.

Loading registry data into Hadoop distributed file system
To analyze the Windows registry data based on the Apache Spark framework, we need to load Windows registry data into HDFS, which is an underlying storage for Apache Spark. In this section, we present an algorithm to process it using Apache Spark operations. Algorithm 4 shows the algorithm for loading the Windows registry into HDFS. The inputs of the algorithm are the registry data exported from Regedit, ExportedData, and the desired number of partitions, nPartitions. The detailed steps of Algorithm 4 are as follows. In Line 2, the algorithm converts the entire Windows registry data into a form of RDD that can be accessed on a distributed cluster environment using parallelize(). In Line 3, it repartitions the entire RDD by nPartitions using repartition(). This operation can be commonly used to control the actual number of partitions in the experiments. In Line 4, the algorithm separates the Windows registry data by a new line character because Regedit exports each entry of the registry data into a single line using flatMap(). In Line 5, it converts each registry entry into the form of nested key-value data using map() by calling a function ConvertRegEntrytoNestedKeyValue(). In Line 6, it merges two registry entries using reduce() by calling a function MergeRegNestedEntries(). Finally, the result RDD, regRDD, becomes a single tree structure stored on HDFS.

Forensics analysis on Windows registry on Apache Spark
In this section, we propose algorithms using Apache Spark operations to perform forensic analysis on the Windows registry for the three scenarios defined in Section 4.2.1.

Forensic for target registry keys.
Algorithm 5 shows ForensicForTargetRegKey() that retrieves the registry entry for a target registry key. The inputs of the algorithm are the registry data exported from Regedit, ExportedData, the desired number of partitions, nPartitions, and a target registry key for forensics, TargetRegKey. The detailed steps of Algorithm 5 are as follows. Here, a sequence of operations in Lines 2*5, i.e., parallelize(), repartition(), flatMap (), and map(), which transforms each registry entry into an RDD form of the nested key-value data, are the same as in Algorithm 4. In Line 6, the algorithm finds the registry entry whose registry key path exactly equals TargetRegKey. Here, a registry key path, 'HKCR\ � \App\MSPaint.exe \content', is given as TargetReg-Key. In Lines 4*5, we process the same steps as in Algorithm 4. In Line 6, we find the registry entry whose registry key is the same as TargetRegKey.

Forensic for registry entries containing keywords. Algorithm 6 shows
ForensicRe-gEntriesUsingKeywords(), which performs forensic analysis for registry entries containing keywords. The inputs of the algorithm are the registry data exported from Regedit, Exported-Data, the desired number of partitions, nPartitions, and a target keyword to find, TargetKeyword. The detailed steps of Algorithm 6 are as follows. Here, a sequence of operations in Lines 2*5, i.e., parallelize(), repartition(), flatMap(), and map() are the same as in Algorithm 4 and Algorithm 5. In Line 6, we find the registry entries whose keys or values contain the given TargetKeyword using filter(). In Line 7, we aggregate all the found registry entries and merge them into a form of the tree-structured data using reduce().  Fig 9 shows an example of the process of forensics on registry entries using keywords according to Algorithm 6. Here, a keyword 'MS' is given as TargetKeyword. In Lines 4*5, we process the same steps as in Algorithm 5. In Line 6, we find the registry entries whose registry keys or registry values contain 'MS'. Here, the first result contains 'MS' in the registry key; the second result contains it in both registry key and value; the third result contains it in the registry value. In Line 7, we merge them as the nested key-value data.

Comparing the entire registry repositories. Algorithm 7 shows
CompareRegRepositories() that compares the entire two Windows registry repositories and finds the differences by extending the ComparingRegEntries() algorithm in Section 4.2.3, which compares a registry entry from one registry repository with another registry repository. The inputs of the algorithm are two registry repositories to compare, i.e., RegRepo 1 and RegRepo 2 , and the desired number of partitions, nPartitions. The detailed steps of Algorithm 7 are as follows. In Lines 2*6, the algorithm converts the entire registry repository, RegRepo 1 , into the nested key-value data, NestedRegRepo 1 . Then, in Lines 7*10, it compares the entire two Windows registry repositories using map() with calling of ComparingRegEntries(). In Line 11, the algorithm aggregates all the result registry entries of RegRepo 1 that are different from the corresponding registry entry in RegRepo 2 using reduce() by calling MergeRegNestedEntries().
Algorithm 7: CompareRegRepositories()  Fig 10 shows an example of the process of comparing the entire two registry repositories according to Algorithm 7. In Lines 2*6, we construct a tree-structure for RegRepo 1 . In Lines 7*10, we compare each registry entry in RegRepo 2 with RegRepo 1 and find the differences. In Line 11, we merge them based on the common registry key path.

Experimental environments and data sets
In this section, we measure the processing time of the proposed distributed algorithms using the scenario. Here, we use actual distributed environments where eight worker nodes are configured based on Apache Spark to show the effectiveness of the algorithms. The effects of distributed processing on Apache Spark are achieved by the following three aspects: 1) providing scalable storage by loading the entire dataset into the cluster over multiple nodes, 2) parallel processing using multiple nodes, and 3) parallel processing within a node using multi-threads with multiple cores. For the first effect, we measure the processing time of the algorithm for loading registry data into HDFS proposed in Section 4.3 using up to eight nodes. For the second effect, we measure the processing time of the three scenarios by changing the number of nodes from 1 to 8. For the third effect, we measure the processing time of the three scenarios by changing the number of CPUs from 1 to 8 on each node. In this paper, we use Google Cloud Platform [42] to build an actual distributed environment where Apache Hadoop 2.10.0 [43] and Apache Spark 2.4.7 [44] are installed. A cluster consists of one master node that manages the overall operations and up to eight slave nodes. Each node is equipped with 2.0 GHz of 8 vCPUs, 3.75GB of memory, and 128GB of disk size. We collected actual Windows registry data from four different systems running Windows operating systems. Table 1 lists the details of the collected Windows registry data. We can extract all the registry entries for a system into a text file with a Windows built-in command "regedit /E" [2]. Hence, we can easily collect the registry entries from all the Windows-installed systems including servers, desktops, and mobile devices, regardless of the Windows versions. The size of Windows registry files collected from Windows 10 depends on the system, i.e., 240MB, 352MB, 413MB, and 556MB, respectively. The detailed system information that is extracted from the registry is described in the table. We note that our proposed framework can be applied into any Windows registry only if we extract the registry entries using the command above. Hence, we used four registries in the table as the examples to measure the performance of the proposed algorithms. We note that their sizes vary widely according to the systems where Registry 1 is the registry obtained immediately after the Windows system is installed. This implies that a lot of information is added and updated in the registry while the Windows operating system is running. In total, we collected 1.561 GB of Windows registry, where over 2 million registry keys are stored. Fig 11 shows the processing time of the algorithm proposed for loading Windows registry into HDFS (See Section 4.3). Here, we measure the processing time varying the number of nodes and the number of CPUs.

Loading Windows registry into Hadoop distributed file system.
As presented in Fig 11(a), when the number of CPUs is fixed to four, the processing time of the algorithm is reduced by up to approximately 1.73 times as the number of worker nodes increases from one to eight. As presented in Fig 11(b), when the number of worker nodes is fixed to four, the algorithm processing time is reduced by up to approximately 2.94 times as the number of CPUs increases from one to eight. Finally, the processing time of the proposed algorithm, where eight worker nodes and eight CPUs are used, is reduced by up to approximately six times compared to the case where one worker node and one CPU are used. Owing to the overhead of the master node, the performance improvement is not exactly proportional to the number of worker nodes and CPUs. However, the results show a definite improvement in the processing performance of the proposed algorithms as the number of worker nodes and CPUs increases in the Apache Spark framework. The results, therefore, establish the scalability of the proposed algorithm. 5.2.2 Scenario 1: Forensic for the target registry key. Fig 12 shows the processing time of the algorithm proposed for forensic for a target registry key (See Sec 4.4.1). Here, we increase the number of CPUs and the number of worker nodes. We used 'HKLM\SOFTWARE\Micro-soft\Windows NT\CurrentVersion\ProfileList', which is associated with user profile information on the computer, as a target registry key. We measured the time for retrieving the registry entry for the target key from Registry 4 10 times and obtained its average time.
As presented in Fig 12(a), when the number of CPUs is fixed to four, the processing time is reduced by up to approximately 1.52 times as the number of worker nodes increases from one to eight. As presented in Fig 12(b), when the number of worker nodes is fixed to four, the processing time is reduced by up to approximately 2.35 times as the number of CPUs increases from one to eight. Here again, the results verify the scalability of the proposed algorithm.   As presented in Fig 13(a), when the number of CPUs is fixed to four, the processing time of the algorithm is reduced by approximately 1.3 times as the number of worker nodes increases from one to eight. As presented in Fig 13(b), when the number of worker nodes is fixed to four, the processing time of the algorithm is reduced by up to approximately 1.91 times as the number of CPUs increases from one to eight. Here again, the results verify the efficiency and scalability of the proposed algorithm. Fig 14 shows the actual results when the keywords 'PHP', 'Exploit', and 'Flash' are given to Registry 4 , respectively. As presented, the algorithm finds all the registry entries that contain a given keyword in the registry key path or the registry value. Fig 14(a)  As presented in Fig 15(a), when the number of CPUs is fixed to four, the processing time is reduced by up to approximately 1.31 times as the number of worker nodes increases from one to eight. As presented in Fig 15(b), when the number of worker nodes is fixed to four, the processing time is reduced by up to approximately 1.47 times as the number of CPUs increases from one to eight. Here once again, the results validate the efficiency and scalability of the proposed algorithm. Table 2 shows the results of comparing the entire registry repositories when we compare Registry 1 with Registry 2 and Registry 4 , respectively. The results are divided into two categories: 1) the registry key exists only in RegRepo 1 , i.e., different registry keys, and 2) the same registry key exists in both registries, but their registry values are different, i.e., different registry values. When we compare Registry 1 with Registry 4 , 243,321 different registry entries are identified, which occupy about 65.01% of Registry 1 . Here, we indicate that the most cases correspond to the second category, i.e., different registry values. Specifically, the first category occupies only 0.71% while the second category occupies 99.28%. When we compare Registry 1 with Registry 2 , 80,141 different registry entries are identified, which occupy about 21.41% of Registry 1 . Here again, the most cases correspond to the second category.   Fig 17 compares the processing time of the proposed distributed algorithms on Apache Spark with them in a single node. We measure the processing time for the algorithms in a single node by configuring all the components for Apache Spark in the node, eliminating network overheads between nodes. For distributed algorithms, we present their results choosing the best environment setting in our distributed configuration according to the number of CPU cores and the number of worker nodes from the previous experiments, i.e., 8 worker nodes along with 4 CPU cores for each node and 4 worker nodes along with 8 CPU cores for each node. Here, we measure the processing time of a total of four algorithms, one for loading the Windows registry into HDFS in single system and the algorithms for three scenarios to perform forensic analysis on the Windows registry.
From the experimental results, we indicate that the proposed distributed algorithms outperform the results in a single node. Specifically, as shown in Fig 17(a), the processing time of loading the Windows registry into HDFS with the proposed distributed algorithm is reduced from 1.77. to 4.45 times compared to that in a single node. As presented in Fig 17(b), the processing time of forensics for the target registry key with the proposed distributed algorithm is reduced from 2.36 times to 2.9 times compared to that in a single system. As presented in Fig  17(c), the processing time of forensics on registry entries with the proposed distributed algorithm is reduced from 1.83 to 2.18 times compared to that in a single node. As presented in Fig  17(d), the processing time of comparing the entire registry entries with the proposed distributed algorithm is reduced from 1.59 times to 1.64 times compared to that in a single node. Because the distributed algorithms on Apache Spark require the inherent network and MapReduce overheads, this improvement of the processing performance verifies the efficiency and scalability of the proposed algorithms. Because we show that only eight nodes in a distributed environment are sufficient to show them, we can easily significantly improve its performance by adding more nodes.

Conclusions
In this study, we investigated large-scale digital forensic investigations on Apache Spark using a Windows registry. We devised distributed algorithms to analyze large-scale registry data collected from a number of Windows systems on the Apache Spark framework. First, we defined three main scenarios in which we classify the existing registry forensic studies into them. Second, we proposed an algorithm to load the Windows registry into HDFS for subsequent forensics. Third, we proposed a distributed algorithm for each defined forensic scenario using Apache Spark operations. Extensive forensic experiments performed on 1.52 GB of Windows registry data collected from four computers have shown that the proposed method can reduce the processing time by up to about 3.31 times as the number of CPUs was increased from 1 to 8 and the number of worker nodes from 2 to 8, validating the efficiency and scalability of the proposed algorithms. Furthermore, we have shown the effectiveness of the framework to integrate and analyze large-scale Windows registries collected from multiple operating systems. For this purpose, we managed them on a scalable distributed file system and proposed an efficient algorithm for extracting and transforming the Windows registry into HDFS. We have also shown the efficiency of the proposed forensic algorithms using Apache Spark operations as the parallelism of the cluster (i.e., the number of worker nodes and CPU cores) increases. Consequently, using the proposed method, we have shown that we can store and manage a large number of Windows registry repositories, which cannot be stored in a single or a small number of systems, on the proposed framework and can analyze them at a high speed. Additionally, we targeted the registry in Windows PCs to demonstrate the effectiveness of the proposed method. However, Windows registries are commonly used for every environment that supports the Windows operating system, including mobile devices and cloud services. This means that we can apply the proposed approach to perform forensic analysis on registries obtained from different environments.